-
Notifications
You must be signed in to change notification settings - Fork 286
Add support for advanced.config #221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- related to issue # - configured through `spec.rabbitmq.advancedConfig` as a string - updating advancedConfig will trigger a sts rolling restart bc similar to `rabbitmq.conf`, advanced config is loaded at node start
@@ -4476,6 +4472,10 @@ spec: | |||
type: string | |||
maxItems: 100 | |||
type: array | |||
advancedConfig: | |||
description: Specify any rabbitmq advanced.config configurations | |||
maxLength: 2000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this limit required? 2000 characters can be exceeded in environments with extensive LDAP queries and OAuth 2 settings, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not required. There is a hard limit in kubernetes that all objects are stored in etcd, and etcd has a limit of 1MB per object which translate to 1048576
characters. I will update the limit to 100000
. Thanks for pointing it out Michael!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a cosmetic suggestion, nothing that stops the ✅ 👌
internal/resource/configmap_test.go
Outdated
advancedConfig, ok := configMap.Data["advanced.config"] | ||
Expect(ok).To(BeTrue()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cosmetic change request 🙃 Let's annotate the assertion so that if this ever fails, we get a message more descriptive than "expected false to be true"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in the new commit ;) good call Aitor
Working on adding syntax validation before updating |
* Add client service override `spec.override.clientService` - this closes #142 - add an additional port example; using the client service override - make sts and svc override's name and namespace not configurable
Pinning to a specific SHA no longer makes sense with no OSL process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the "failure" scenario where an invalid advanced config is provided, I don't think we are loud enough to surface that there is a problem. I think that updating the status condition and logging an event is not enough, because the user has to decide to take the additional step of describing the RabbitmqCluster
resource.
In the current implementation, I can't think of any motivation for the user to take a diagnosing step, given that all pods are running and "everything just works". It'll be at a later time, when RabbitMQ is not doing what's expected (due to advanced config not being loaded), that a user would suspect there is something not right. And even at this point, I don't think that describing the custom resource is a step one would think of taking as a first step.
I think this safeguard is protecting availability in a scenario where availability is not expected i.e. single RabbitMQ node. In a single node scenario, a user was expecting to lose availability due to the node restart to apply the advanced config. I don't think that preventing the restart (when it was expected) and never reconciling again (due to #225) are a better alternative to letting RMQ crash due to invalid configuration.
In the scenario of a 3 node cluster, the StatefulSet controller will preserve availability of the cluster because it won't roll the surviving 2 replicas until the first one is ready which will never be due to the invalid advanced config.
The latest changes have modified the behaviour of the Operator and the original review is no longer valid.
You are totally right that the implementation here is far from ideal. This is part of the reason why I think it is important for us to figure out a more reliable way to load updated configurations and created issue #225 In my opinion, it seems very easy for anyone to make syntax mistakes with their advanced config, and a This all depends on how much we want to prevent a server crash. If a server crash is something users are totally ok with and are used to handle, remove the validation and let it happen makes sense to me. However, if it is a failure mode that causes pain for users, I do not see why we won't add an additional validation to avoid arriving there. I would like to hear other's opinions on this subject as well. Thoughts? @mkuratczyk @j4mcs |
I think that a node crash, when a user was expecting downtime, does not incur in panic or extra pain. The user was expecting downtime anyway. In my opinion, a Pod that enters a crash loop backoff is a more evident, direct, right-in-your-face symptom that there is a problem. A Pod that is just running, when you expected it to roll, doesn't seem so evident to me that there is a problem. Maybe the change did not require a Pod restart? Plugins do not restart Pods, so maybe this change doesn't either? Maybe the update has not been scheduled yet? It's not evident unless the user takes further actions.
I've been wondering how true this statement is in the world outside. For us, it is very clear. I have not yet heard any mention of conditions, or the CR status in any feedback communication we had so far. I've been wondering for a while if the status conditions isn't something we convinced ourselves that "it is the way". Even when I'm troubleshooting a problem during development, I rarely check status conditions or status objects unless I have a good reason to; instead I find myself doing
Phrasing my arguments in other way: what does a server crash cause? Downtime. A single RabbitMQ node was expecting downtime, so we are not avoiding an unexpected event. In a 3 node cluster, what does one Pod in crash loop backoff cause? Degraded availability, no downtime. The service continuity remains intact. I definitely think that a Pod in crash loop backoff is a much better option than silently masking the error in a status condition. |
I see good points and both sides and haven't made my mind up yet. However, aren't validation webhooks the right tool for this job? If we just rejected changes to the CR if |
@Zerpet @mkuratczyk validation webhooks will be a more ideal place. I've deleted the logic on validation. The branch history looks a bit messy at the moment. I will squash merge. |
I'm a bit wary of spending too much time in K8s concerned with a RabbitMQ fault. Ultimately if the user misconfigures their RabbitMQ cluster, the failure mode will be determined by RabbitMQ. I feel like we're trying to patch behaviour of the underlying technology with reconciliation logic. If an invalid advanced config should result in a spectacular failure of the RabbitMQ cluster then that's the core's responsibility. I agree that we've made a conscious opinionated decision that conditions are where we surface underlying faults. But I don't see an issue with that. Ultimately, we've taken a cue from Knative and it's definitely not an established standard. I worry that unless we let misconfigurations like these run their course we're going to have more and more edge cases bloat reconciliation. All that to say I'm happy with the code as it is. I'd be up for exploring a validating webhook, but I think it needs to be opt in, least of all because it will mean loading erlang to do syntax checking. |
I agree and resonate with James' comment. I'm happy with the current state and in favour of exploring validating webhooks. |
In my opinion, the benefit our operator can provide, is the improvement on operating core rabbitmq on k8s. The operator should make the experience of deploying and running rabbitmq on k8s easier for users than them using own solution. If we were to say that experience improvement does not belong in the operator, then how are we offering anything different from a templating tool that the bitnami chart is offering for 2 years already? I don't know the answer to this question, nor do I have a clean definition of what counts as a reasonable experience improvement. However, I disagree with the the statement that we should avoid patching rabbitmq behavior with our operator. The PR is merged. I'm fine the next step that is proposed. We can chat about the responsibility boundary of the operator some other time. This will definitely be a topic that comes up repeatedly. |
* Add support for advanced.config - related to issue #219 - configured through `spec.rabbitmq.advancedConfig` as a string - updating advancedConfig will trigger a sts rolling restart bc similar to `rabbitmq.conf`, advanced config is loaded at node start - set maxLength for advancedConfig to 100000
This closes #219
Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed
Summary Of Changes
spec.rabbitmq.advancedConfig
as a stringrabbitmq.conf
, advanced config is loaded at node startbefore restarting the statefulSet, controller runsrabbitmqctl eval <advanced.config>
to validate that the input has correct syntax.rabbitmqctl eval
evaluate the input, so it's not a minimal syntax validation. Some configurations inadvanced.config
can be loaded while rabbit is running, and some can only to loaded at node start. I have look intoerl -noshells -eval
or writing an escript, but could not figure out how to successfully use them. So I settled at usingrabbitmqctl eval
.Additional Context
Local Testing
Have run unit, integration and system tests.
This is tested at the unit and system tests level.